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Chapter  6 

Robust  Order  Statistics 
based  Ensembles  for 
Distributed  Data  Mining 


Kagan  Turner  and  Joydeep  Ghosh 


6.1  Mining  of  Distributed  Data  Sources 

An  implicit  assumption  in  traditional  statistical  pattern  recognition  and  ma¬ 
chine  learning  algorithms  is  that  the  data  to  be  used  for  model  development 
is  available  as  a  single  flat  file.  This  assumption  is  valid  for  virtually  all  pop¬ 
ular  benchmark  datasets  such  as  those  available  from  ELENA,  Statlog  or  the 
UCI  machine  learning  repository.  Such  datasets  are  small  or  medium  sized, 
requiring  a  few  megabytes  at  most.  Thus  the  algorithms  typically  also  assume 
that  the  entire  data  can  fit  in  main  memory,  and  do  not  address  computational 
issues  regarding  scalability  and  “out-of-core”  operations. 

The  tremendous  explosion  in  the  amount  of  data  gathering  and  warehous¬ 
ing  in  the  past  few  years  has  generated  very  large  and  complex  databases. 
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Any  effort  in  mining  information  from  such  databases  has  to  address  the  fact 
that  (1)  data  may  be  kept  in  several  files  as  in  interlinked  relational  databases, 
and  information  needed  for  decision  making  may  be  spread  over  more  than 
one  file.  For  example,  the  concept  of  “collective  data  mining”  (see  chapter 
5)  explicitly  addresses  “vertical  partitioning”  situations  where  the  features  or 
variables  relevant  to  a  classification  decision  are  spread  over  multiple  files, 
each  accessible  to  only  one  classifier;  (2)  the  files  may  be  spread  across  sev¬ 
eral  disks  or  even  across  different  geographical  locations,  and  (3)  the  statistical 
quality  of  data  may  vary  widely.  For  example  the  percentage  of  cases  involv¬ 
ing  financial  or  health-care  fraud  varies  in  different  regions,  and  so  does  the 
amount  of  missing  information. 

One  can  argue  that  by  transfering  all  data  to  a  single  warehouse  and  per¬ 
forming  a  series  of  merges  and  joins,  one  can  get  a  single  (albeit  very  large), 
flat  file.  A  traditional  algorithm  can  be  used  after  randomizing  and  subsam¬ 
pling  this  file.  But  in  real  applications  this  approach  may  not  be  feasible  be¬ 
cause  of  the  computational,  bandwidth  and  storage  costs.  In  certain  cases, 
it  may  not  even  be  possible  for  a  variety  of  practical  reasons  including  se¬ 
curity,  privacy,  proprietary  nature  of  data,  need  for  fault  tolerant  distribution 
of  data  and  services,  real-time  processing  requirements,  statutary  constraints 
imposed  by  law,  etc.  (see  chapter  3).  Then  there  are  two  options.  If  the 
owners  of  the  individual  databases  are  willing  to  provide  high  level  or  sum¬ 
mary  information/decisions  such  as  local  classification  estimates,  and  transmit 
this  information  to  a  central  location,  then  a  metaleamer  can  be  applied  to  the 
component  decisions  to  come  up  with  a  final,  composite  decision.  Note  that 
such  high  level  information  not  only  has  reduced  storage  and  bandwidth  re¬ 
quirements,  but  also  maintains  the  privacy  of  individual  records  (DuMouchel 
1999.  Otherwise  one  has  to  resort  to  a  distributed  computing  framework  such 
as  the  emerging  field  of  collective  intelligence  (COIN),  wherein  techniques 
are  developed  such  that  local  and  independent  computations  can  still  increase 
a  desired  global  utility  function  (Wolpert  and  Turner  1999). 

The  first  option  leads  to  several  issues  reminiscent  of  studies  in  decision 
fusion  (Dasarathy  1994)  applied  largely  to  multi-sensor  fusion  and  distributed 
control  problems.  It  is  also  related  to  the  theory  of  ensembles  (Sharkey  1996) 
and  integration  of  multiple  learned  models  (Chan,  Stolfo,  and  Wolpert  1996). 
But  there  are  substantial  new  aspects  that  need  to  be  addressed  in  a  distributed 
data  mining  context. 

This  chapter  is  rooted  in  the  ensemble  framework  and  shows  how  order 
statistics  can  be  used  in  the  design  of  a  “metaleamer”  that  examines  the  out¬ 
puts  of  multiple  distributed  classifers  and  provides  a  final  decision.  Order 
statistics  is  one  of  the  key  tools  of  robust  statistics,  tailored  to  handling  data 
with  outliers.  In  a  distributed  data  mining  scenario  in  which  there  is  wide 
variability  among  the  individual  classifers  because  of  the  underlying  quality 
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of  the  local  data  that  they  examine,  a  metaleamer  should  be  able  to  tolerate 
a  few  outlier  classifier  results.  The  robust  properties  of  order  statistics  based 
approaches  such  as  median  filtering  and  m-estimators  (Arnold,  Balakrishnan, 
and  Nagaraja  1992),  have  been  observed  in  many  disciplines.  Thus  they  are 
an  obvious  candidate  for  metaleaming  in  such  environments. 

The  next  section  provides  a  brief  review  of  the  metaleaming  framework  for 
classification  to  put  the  proposed  techniques  in  perspective.  Section  6.3  sum¬ 
marizes  the  relationship  between  classifier  errors  and  decision  boundaries  and 
provides  the  necessary  background  for  mathematically  analyzing  order  statis¬ 
tic  combiners  (Turner  and  Ghosh  1996a).  Section  6.4  introduces  simple  order 
statistic  combiners.  Based  on  these  concepts,  in  section  6.5  we  propose  two 
powerful  combiners,  trim  and  spread ,  and  derive  the  amount  of  error  reduc¬ 
tion  associated  with  each.  In  section  6.6  we  present  the  performance  of  order 
statistic  combiners  on  several  datasets.  Section  6.7  discusses  the  implications 
of  using  linear  combinations  of  order  statistics  as  a  strategy  for  pooling  the 
outputs  of  individual  classifiers. 


6.2  A  Brief  History  of  Multi-Learner  Systems 

The  idea  of  integrating  multiple  models  for  the  same  problem,  has  been  exam¬ 
ined  for  a  long  time.  The  main  goal  is  to  obtain  a  better  composite  global 
model,  with  more  accurate  and  reliable  estimates  or  decisions.  Some  no¬ 
table  early  systems  include  Selffidge’s  Pandemonium  (Selfridge  1958)  where 
a  head-deamon  would  select  the  deamon  that  “shouted  the  loudest,”  and  Nils¬ 
son’s  committee  machines.  A  strong  motivation  for  such  systems  was  voiced 
by  Kanal  in  his  classic  1974  paper  (Kanal  1974): 

“It  is  now  recognized  that  the  key  to  pattern  recognition  problems 
does  not  lie  wholly  in  learning  machines,  statistical  approaches, 
spatial,  filtering,  ...,  or  in  any  other  particular  solution  which  has 
been  vigorously  advocated  by  one  or  another  group  during  the  last 
one  and  a  half  decades  as  the  solution  to  the  pattern  recognition 
problem.  No  single  model  exists  for  all  pattern  recognition  prob¬ 
lems  and  no  single  technique  is  applicable  to  all  problems.  Rather 
what  we  have  is  a  bag  of  tools  and  a  bag  of  problems.” 

In  the  late  1970s,  much  work  was  done  on  combining  linguistic  and  statistical 
models,  and  on  combining  heuristic  search  with  statistical  pattern  recognition. 
Subsequently,  similar  sentiments  on  the  importance  of  multiple  approaches 
were  also  voiced  in  the  AI  community  (Minsky  1991): 

‘To  solve  really  hard  problems,  we’ll  have  to  use  several  different 
representations.  ...  It  is  time  to  stop  arguing  over  which  type 
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of  pattern-classification  technique  is  best.  ...  Instead  we  should 
work  at  a  higher  level  of  organization  and  discover  how  to  build 
managerial  systems  to  exploit  the  different  virtues  and  evade  the 
different  limitations  of  each  of  these  ways  of  comparing  things.” 

Integration  of  multiple  data  sources  and  /  or  learned  models  can  now  be 
found  in  several  disciplines,  for  example,  the  combining  of  estimators  in  econo¬ 
metrics  (Granger  1989),  evidences  in  rule-based  systems  (Barnett  1981)  and 
multi-sensor  data  fusion  (Dasarathy  1994).  Of  course,  one  can  find  numerous 
examples  in  the  human  central  nervous  system  (Shepherd  1979),  as  well  as  in 
some  large  engineering  systems  such  as  those  that  demand  fault  tolerance  or 
employing  control  mechanisms  that  may  need  to  function  in  different  operat¬ 
ing  regimes  (Narendra,  Balakrishnan,  and  Ciliz  1995).  In  particular,  multiple 
models  for  nonlinear  control  has  a  long  tradition.1  Hybridization  in  a  broader 
sense  is  seen  in  efforts  to  combine  two  or  more  of  neural  network,  Bayesian, 
GA,  fuzzy  logic  and  knowledge-based  systems  (Aggarwal,  Ghosh,  Nair,  and 
Taha  1996, Taha  and  Ghosh  1997).  The  goal  is  again  to  incorporate  diverse 
sources  and  forms  of  information  and  to  exploit  the  somewhat  complementary 
nature  of  different  methodologies. 

Figure  6.1  shows  a  generic  diagram  of  an  ensemble ,  the  simplest  and  most 
well  understood  type  of  multi-learner  systems.  Each  component  learner  is  a 
regressor  or  classifier,  trying  to  solve  the  same  task.  While  data  ultimately 
originates  from  an  underlying  universal  set  X,  each  learner  may  receive  some¬ 
what  different  subsets  of  the  data  for  “training”  or  parameter  estimation  (as  in 
bagging  [Breiman,  1994]  and  boosting  [Drucker,  Cortes,  Jackel,  LeCun,  and 
Vapnik  1994]),  and  may  be  using  different  feature  extractors  (/s)  on  the  same 
raw  data.  For  example,  in  our  earlier  work  on  sonar  classification  (Ghosh, 
Deuser,  and  Beck  1992),  Fourier,  wavelet  and  autoregressive  coefficients  ex¬ 
tracted  from  the  same  preprocessed  time  series  were  used  respectively  for  three 
different  classifier  types. 

Along  with  selection  of  training  samples  and  feature  extractors,  one  needs 
to  decide  what  types  of  learners  to  use  and  how  many,  and  finally,  how  to 
design  the  metaleamer.  There  are  also  larger  issues  of  how  to  train  the  com¬ 
ponents  given  that  they  are  part  of  a  bigger  system,  and  to  estimate  the  overall 
gains  achievable. 

The  simplest  metaleamer  is  the  combiner ,  where  the  output  y  is  deter¬ 
mined  solely  from  the  outputs  of  the  individual  learners.  In  the  past  few  years, 
a  host  of  experimental  results  from  both  neural  network  and  machine  learning 
communities  show  that  combining  the  outputs  of  multiple  regressors  or  classi¬ 
fiers  via  (weighted)  averaging,  majority  vote,  product  rule,  entropy,  etc.,  pro- 

1  See  www.itk.ntnu.no/ansatte/Johansen.  Tor.Ame/mmamc/address.html  for  a  detailed  list  of 
researchers  in  this  area. 
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Figure  6.1:  Generic  architecture  of  a  multi-learner  system  for  regression  or 
classification 


vides  statistically  significant  improvement  in  performance  along  with  tighter 
confidence  intervals.  Moreover,  theoretical  analysis  has  been  developed  for 
both  regression  (Perrone  1993,  Hashem  and  Schmeiser  1993)  and  classifica¬ 
tion  (Turner  and  Ghosh  1996a,  1999),  to  estimate  the  gains  achievable. 

Beyond  simple  combiners  are  metaleaming  methods  such  as  arbitration 
(Chan  and  Stolfo  1997)  and  stacking  (Wolpert  1992).  Also  in  this  category 
are  divide-and-conquer  approaches,  where  relatively  simple  learners  special¬ 
ize  in  different  parts  of  the  input-output  space,  and  the  total  model  is  a  (pos¬ 
sibly  soft)  union  of  such  simpler  models.  Techniques  of  modular  learning 
include  “mixtures-of-experts,”  local  linear  regression,  CART/MARS,  adap¬ 
tive  subspace  models,  etc.  (Jordan  and  Jacobs  1994;  Ramamurti  and  Ghosh 
1999;  Holmstrom,  Koistinen,  Laaksonen,  and  Oja  1997).  What  distinguishes 
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all  these  models  from  simple  combiners  is  that  the  metaleamer’s  actions  now 
also  depends  on  the  current  input  and/or  the  target  values,  i.e.  the  dotted  lines 
in  figure  6.1  are  also  used. 

The  simple  combining  methods  are  best  suited  for  problems  where  the 
individual  classifiers  perform  the  same  task,  and  have  comparable  success. 
However,  such  combiners  are  more  susceptible  to  outliers  and  to  unevenly 
performing  classifiers.  On  the  other  hand,  the  more  general  metaleamers  are 
conceptually  more  powerful  but  are  vulnerable  to  all  the  problems  associated 
with  the  added  learning  (e.g.,  overparameterizing,  lengthy  training  time).  They 
may  also  require  access  to  the  entire  input  database  which  may  be  impractical 
in  a  distributed  data  mining  environment. 

The  performance  of  classifier  ensembles  has  been  spectacular.  Brieman 
calls  the  combination  of  decision  tree  classifiers  with  boosting  the  most  sig¬ 
nificant  development  in  classifier  design  in  this  decade.  Indeed,  ensemble 
methods  are  even  becoming  available  in  commercial  data  mining  tools  such 
as  SAS  Enterprise  Miner  Version  3.0.  To  see  intuitively  why  ensembles  have 
been  so  effective,  note  that  different  types  of  classifiers  have  different  “in¬ 
ductive  biases’’  (Geman,  Bienenstock,  and  Doursat  1992;  Mitchell  1997),  and 
thus,  in  general,  they  do  not  generalize  in  identical  ways  even  when  they  are 
trained  on  the  same  data  set,  and  have  comparable  performance  on  a  test  set. 
Traditionally,  the  classifier  perceived  as  the  “best,”  as  indicated  by  a  suitable 
scoring  function  such  as  a  cross-validation  based  estimation  of  the  true  gener¬ 
alization  error,  is  selected.  This  means  that  the  other  classifiers  that  had  been 
designed  during  the  model  exploration/development  phase,  get  discarded.  But 
this  results  in  a  potential  loss  of  useful  information  and  effort.  An  ensemble 
can  effectively  make  use  of  such  complementary  information  to  reduce  model 
variance  (Perrone  1993,  Turner  and  Ghosh  1999)  and  in  certain  situations  it 
also  reduces  bias  as  shown  by  the  theory  of  large  margin  classifiers  (Bartlett 
and  Shawe-Taylor,  1998).  It  works  best  when  each  learner  is  well  trained, 
but  different  learners  generalize  in  different  ways,  i.e.,  there  is  diversity  in 
the  ensemble  (Krogh  and  Vedelsby  1995).  Diversity  may  be  induced  through 
different  presentations  of  the  input  data,  as  in  bagging,  variations  in  learner 
design,  or  by  adding  a  penalty  to  the  ouputs  to  encourage  diversity. 

For  large  data  sets,  there  is  also  a  computational  reason  for  using  met¬ 
aleamers,  namely,  it  can  make  certain  inductive  learners  fairly  scalable  to  large 
data  sets  (Chan  and  Stolfo  1997;  Provost  and  Kolluri  1999;  also  see  chapter  1 
in  this  book).  In  Chan  and  Stolfo  (1997)  each  component  classifier  uses  a  ran¬ 
domly  chosen  subset  of  the  entire  data  set,  and  the  decision  tree  classifiers  can 
operate  in  parallel.  This  proves  quite  effective  as  overall  computation  can  be 
greatly  reduced  with  little  loss  in  performance.  A  beneficial  side-effect  of  us¬ 
ing  smaller  datasets  for  decision  tree  classifiers  is  that  the  classifiers  obtained 
are  smaller  in  size  (Oates  and  Jensen  1998).  These  two  as  well  as  other  scala- 
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bility  techniques  for  decision  tree  based  classifiers  are  covered  in  a  nice  recent 
survey  (Provost  and  Kolluri  1999). 

A  fundamental  assumption  in  all  the  multi-classifier  approaches  mentioned 
above  is  that  the  designer  has  access  to  the  entire  data  set,  which  can  be  used 
in  its  entirety,  resampled  in  a  random  (bagging)  or  weighted  (boosting)  way 
(Breiman  1999),  or  randomly  partitioned  and  distributed.  Thus,  except  for 
boosting  situations,  each  classifier  sees  training  data  of  comparable  quality.  If 
the  individual  classifiers  are  then  appropriately  chosen  and  trained  properly, 
their  performances  will  be  (relatively)  comparable  in  any  region  of  the  prob¬ 
lem  space.  So  gains  from  combining  are  derived  from  the  diversity  among 
classifiers  rather  that  by  compensating  for  weak  members  of  the  pool. 

This  assumption  is  clearly  invalid  for  distributed  data  mining  using  het¬ 
erogenous  sites  (see  chapter  5).  Such  real-life  conditions  often  result  in  a  pool 
of  classifiers  that  may  have  significant  variations  in  their  overall  performance. 
Moreover,  they  may  lead  to  conditions  where  individual  classifiers  have  simi¬ 
lar  average  performance,  but  substantially  different  performance  over  different 
parts  of  the  input  space.  In  such  cases,  combining  is  still  desirable,  but  neither 
simple  combiners  nor  metaleamers  are  particularly  well-suited  for  the  type 
of  problems  that  arise.  For  example,  the  simplicity  of  averaging  the  classi¬ 
fier  outputs  is  appealing,  but  the  prospect  of  one  poor  classifier  corrupting  the 
combiner  makes  this  a  risky  choice.  Weighted  averaging  of  classifier  outputs 
appears  to  provide  some  flexibility  (Hashem  and  Schmeiser  1993,  Merz  and 
Pazzani  1997).  Unfortunately,  the  weights  are  still  assigned  on  a  per  classi¬ 
fier  basis  rather  than  a  per  sample  or  per  class  basis.  If  a  classifier  is  accurate 
only  in  certain  areas  of  the  input  space,  this  scheme  fails  to  take  advantage 
of  the  variable  accuracy  of  the  classifier  in  question.  Using  a  metaleamer  that 
provides  different  weights  for  different  patterns  can  potentially  solve  this  prob¬ 
lem,  but  at  a  considerable  cost.  Also,  as  explained  earlier,  the  off-line  training 
of  a  metaleamer  using  substantial  amount  of  data  outputted  by  geographically 
distributed  classifiers,  may  not  be  feasible  or  even  allowable.  In  addition  to 
providing  robustness  demanded  in  such  situations,  the  order  statistic  combin¬ 
ers  presented  in  this  work  also  aim  at  bridging  the  gap  between  simplicity  and 
generality  by  allowing  the  flexible  selection  of  classifiers  without  the  associ¬ 
ated  cost  of  training  metaclassifiers. 


6.3  Error  Characterization  in  a  Single  Classifier 

In  this  section  we  summarize  the  approach  and  results  of  Turner  and  Ghosh 
(1996a,  1999), 2  that  quantify  the  effect  of  inaccuracies  in  estimating  a  poste¬ 
rior  class  probabilities  on  the  classification  error  for  a  single  classifier.  This 

2  This  and  other  related  papers  can  be  downloaded  from  www.lans.ece.utexas.edu. 
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Figure  6.2:  Error  regions  associated  with  approximating  the  a  posteriori  prob¬ 
abilities  (Tugh  1996). 


background  is  needed  to  characterize  and  understand  the  impact  of  order  statis¬ 
tics  combiners,  as  described  in  sections  6.3  and  6.4. 

It  is  well  known  that,  given  one-of-L  desired  outputs  and  sufficient  training 
samples  reflecting  the  class  priors,  the  outputs  of  certain  classifiers  trained 
to  minimize  a  mean  square  or  cross-entropy  error  criteria,  approximate  the 
a  posteriori  probability  densities  of  the  corresponding  classes  (Richard  and 
Lippmann  1991;  Ruck,  Rogers,  Kabrisky,  Oxley,  and  Suter  1990).  Based  on 
this  result,  one  can  model  the  zth  output  of  the  mth  such  classifier  as: 


=  +  €*(*)•  (6.1) 

where  pi  (*)  is  the  true  posterior  for  zth  class  on  input  x ,  and  (x)  is  the  error 
of  the  mth  classifier  in  estimating  that  posterior. 

Now,  let  us  decompose  the  error  into  two  parts:  €™(;t)  =  -f  ??  •”(*). 
The  first  component  does  not  vary  with  the  input,  and  provides  an  offset,  or 
systematic  error  for  each  class.  The  second  component  gives  the  variability 
from  that  systematic  error,  for  each  x  in  each  class,  and  has  zero  mean  and 
variance  a*  These  two  components  of  the  error  are  similar  to  the  bias  and 
variance  decomposition  for  a  quadratic  loss  function  given  in  Geman,  Bienen- 
stock,  and  Doursat  (1992),  although  they  are  at  the  individual  input  level.  We 
will  therefore  refer  to  classifiers  as  “biased”  and  “unbiased”  implying  ft™  ^  0 
for  some  k,  m,  and  =  0,  Vfc,  m,  respectively.  Let  If1  denote  the  offset  be¬ 
tween  the  ideal  class  boundary,  x*  (based  on  pi(x)  =  pj{x))  and  the  realized 
boundary,  x™  (based  on  f™{x)  —  /"*(*)),  as  shown  in  figure  6.2  (Turner  and 
Ghosh  1996a).  This  boundary  offset  (bm  =  x™  —  x*)  has  mean  and  variance 
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given  respectively  by: 


= 


am  om 

Pj  zh 


and 


2  2 

,  °V?(x)  +<X<M 

abm  = - 3 - , 


(6.2) 


(6.3) 


where  s  —  pUx *)  —  p\{x*)  as  introduced  in  Turner  and  Ghosh  (1996a). 

Let  us  further  denote  the  probability  density  function  of  this  boundary  off¬ 
set  by  fb(x).  The  expected  model  error  associated  with  the  selection  of  a 
particular  classifier  m,  can  then  be  expressed  as: 


/oo 

A(b)Mb)db, 

-OO 


(6.4) 


where  A(b)  =  /** +b  (pj(x)  —  pt(x))  dx  is  the  error  due  to  the  selection  of 
a  particular  decision  boundary.  In  general,  it  is  not  possible  to  obtain  the 
density  function  for  the  boundary  offset  without  making  assumptions  on  the 
distributions  of  the  errors.  However,  a  first  order  approximation,  derived  in 
Turner  and  Ghosh  (1996a),  leads  to: 

/OO  1 

-b2sfh(b)db.  (6.5) 

-OO  L 

Let  us  define  the  first  and  second  moments  of  the  boundary  offset  as  follows: 

/OO  /’OO 

x fb(x)dxandM2  =  I  x2fb(x)dx. 

-OO  J  —  OO 


If  the  individual  classifiers  are  unbiased,  the  offset  bm  of  a  single  classifier  has 
.Adi  =  0  and  AA'i  —  o-^n,  leading  to: 


r?tn  _ 

^ model 


sMi 

2 


S<Jbm 

~F‘ 


(6.6) 


Now,  if  the  classifiers  are  biased,  the  variance  of  b  is  left  unchanged  (given 
by  Equation  6.3),  but  the  mean  becomes  fi  =  —  ■  In  other  words,  we  have 

Mi  =  fim  and  abm  =  M2  —  M\2,  leading  to  the  following  model  error: 


Kodem  =  Sjj1  =  + (n2).  (6.7) 

To  emphasize  the  distinction  between  biased  and  unbiased  classifiers,  the  model 
error  will  be  given  as  a  function  of  /3  for  biased  classifiers.  A  more  detailed 
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derivation  of  class  boundaries  and  error  regions  is  presented  in  Turner  and 
Ghosh  (1996a).  For  analyzing  the  error  regions  after  combining  and  compar¬ 
ing  them  to  the  single  classifier  case,  one  needs  to  determine  how  the  first  and 
second  moments  of  the  boundary  distributions  are  affected  by  combining.  The 
following  sections  focus  on  obtaining  those  values  for  order  statistics  based 
combiners. 


6.4  Combining  Multiple  Classifiers  through 
Order  Statistics 

AUTHOR:  PLEASE  INSERT  AT  LEAST  TWO  SENTENCES  BETWEEN 
THE  HEADING  AND  SUBHEADING.  HEADINGS  CANNOT  DIRECTLY 
FOLLOW  ONE  ANOTHER. 

6.4.1  Basic  Concepts 

In  this  section,  we  briefly  discuss  some  basic  concepts  and  properties  of  or¬ 
der  statistics.  Let  X  be  a  random  variable  with  probability  density  function 
/*(*),  and  cumulative  distribution  function  Fx(').  Let  (Xj,  X2i  •  ■  • ,  XaO  be 
a  random  sample  drawn  from  this  distribution.  Now,  let  us  arrange  them  in 
nondecreasing  order,  providing: 

X\:N  <  X2:N  <  •  *  *  <  Xm:N- 

The  ith  order  statistic  denoted  by  X,-:yv,  is  the  ith  value  in  this  progression. 
The  cumulative  distribution  function  for  the  smallest  and  largest  order  statistic 
can  be  obtained  by  noting  that: 

FXN:N(x)  =  P(XN:N  <x)  =  n tlP(Xi:N  <x)  =  [Fx(x)f 


and: 


FxunM  =  P(X UN  <  X)  =  1  -  P(XUN  >  x) 


=  i-n  *LiP(Xf.N>x) 

=  1  -  (1  -  n£j  P(Xi:N  <  x)  =  1  -  [I  -  Fx(x)f 

The  corresponding  probability  density  functions  can  be  obtained  from  these 
equations.  In  general,  for  the  ith  order  statistic,  the  cumulative  distribution 
function  gives  the  probability  that  exactly  i  of  the  chosen  X’s  are  less  than  or 
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equal  to  x.  The  probability  density  function  of  X,;/v  is  then  given  by  David 
(1970): 

/x, ■:*(*>  =  vxwr1 n  -  **<*>]""'  /*(*>•  (6-8) 

This  general  form  however,  cannot  always  be  computed  in  closed  form.  There¬ 
fore,  obtaining  the  expected  value  of  a  function  of  x  using  equation  6.8  is  not 
always  possible.  However,  the  first  two  moments  of  the  density  function  are 
widely  available  for  a  variety  of  distributions  (Arnold,  Balakrishnan,  and  Na- 
garaja  1992).  These  moments  can  be  used  to  compute  the  expected  values  of 
certain  specific  functions,  e.g.,  polynomials  of  order  less  than  two. 


6,4.2  Combining  Unbiased  Classifiers 
through  Order  Statistics 

Now,  let  us  turn  our  attention  to  order  statistics  (OS)  combiners.  For  a  given 
input  x ,  let  the  network  outputs  of  each  of  the  N  classifiers  for  each  class  i  be 
ordered  in  the  following  manner: 

fw<fw<-<r  (*>• 

Then  one  constructs  the  /cth  order  statistic  combiner,  by  selecting  the  k\h 
ranked  output  for  each  class  (//:yV(x)),  as  representing  its  posterior. 

In  particular,  max,  med  and  min  combiners  are  defined  as  follows: 


ftmaxM 

f,medW 

f,min(x) 


Z-.N 

S?  (x)+fJ 
..  .  2 


if  N  is  even 
if  N  is  odd, 


(6.9) 

(6.10) 

(6.11) 


These  three  combiners  are  relevant  because  they  represent  important  qualita¬ 
tive  interpretations  of  the  output  space.  Selecting  the  maximum  combiner  is 
equivalent  to  selecting  the  class  with  the  highest  posterior.  Indeed,  since  the 
network  outputs  approximate  the  class  a  posteriori  distributions,  selecting  the 
maximum  reduces  to  selecting  the  classifier  that  is  the  most  “certain”  of  its 
decision.  The  drawback  of  this  method  however  is  that  it  can  be  compromised 
by  a  single  classifier  that  repeatedly  provides  high  values.  The  selection  of 
the  minimum  combiner  follows  a  similar  logic,  but  focuses  on  classes  that  are 
unlikely  to  be  correct,  rather  than  on  the  correct  class.  Thus,  this  combiner 
eliminates  less  likely  classes  by  basing  the  decision  on  the  lowest  value  for  a 
given  class.  This  combiner  suffers  from  the  same  ills  as  the  max  combiner. 
However,  it  is  less  dependent  on  a  single  error,  since  it  performs  a  min-max 
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operation,  rather  than  a  max-max3.  The  median  classifier  on  the  other  hand 
considers  the  most  “typical”  representation  of  each  class.  For  highly  noisy 
data,  this  combiner  is  more  desirable  than  either  the  min  or  max  combiners 
since  the  decision  is  not  compromised  as  much  by  a  single  large  error. 

The  analysis  that  follows  does  not  depend  on  the  particular  order  statistic 
chosen.  Therefore,  we  will  denote  all  OS  combiners  by  f£5(x)  and  derive  the 
model  error,  E™odel.  The  network  output  provided  by  fgs(x)  is  given  by: 

/"(*)  =  Pk(x)  +  €?  (*),  (6.12) 

Let  us  first  investigate  the  zero- bias  case  (fa  =  0,  VJfc),  where  we  get 
^s(x)  =  r]^5{x).  Proceeding  as  in  section  6.3,  the  boundary  bos  is  shown 
to  be: 


bos  = 


ri?(xb)-n0js(xb) 

s 


(6.13) 


For  Ltd.  r]k  s,  the  first  two  moments  will  be  identical  for  each  class.  Moreover, 
taking  the  order  statistic  will  shift  the  mean  of  both  7i°s  and  rfis  by  the  same 
amount,  leaving  the  mean  of  the  difference  unaffected.  Therefore,  bos  will 
have  zero  mean,  and  variance: 


Jb°s 


24 

s 2 


2aoi 

*1 k  7 

— —  =  <**»*> 


(6.14) 


where  a  is  a  reduction  factor  that  depends  on  the  order  statistic  and  on  the 
distribution  of  b.  For  most  distributions,  a  can  be  found  in  tabulated  form 
(Arnold,  Balaknshnan,  and  Nagaraja  1992).  For  example,  table  6.1  provides 
a  values  for  all  order  statistic  combiners,  up  to  10  classifiers,  for  a  Gaussian 
distribution  (Arnold,  Balaknshnan,  and  Nagaraja  1992,  Sarhan  and  Greenberg 
1956).  (Because  this  distribution  is  symmetric,  the  a  values  of  /  and  k  where 
/  +  £  =  A+  lare  identical,  and  listed  in  parenthesis). 

Returning  to  the  error  calculation,  we  have:  M°s  =  0,  and  =  cr£os, 
providing: 


pos  _ 

^  model 


sMf 

~2 


sacri 


b m 


=  aEmoder 


(6.15) 


Equation  6.15  shows  that  the  reduction  in  the  error  due  to  using  the  OS 
combiner  instead  of  the  mth  classifier  is  direcdy  related  to  the  reduction  in 
the  variance  of  the  boundary  offset  b .  Since  the  means  and  variances  of  order 
statistics  for  a  variety  of  distributions  are  widely  available  in  tabular  form,  the 
reductions  can  be  readily  quantified. 


3 Recall  that  the  pattern  is  ultimately  assigned  to  the  class  with  the  highest  combined  output. 
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N 

k 

a 

N 

k 

a 

N 

k 

Of 

1 . . 

1 

1.00 

6 

2(5) 

.280 

1(9) 

.357 

2 

1(2) 

.682 

3(4) 

.246 

2(8) 

.226 

3 

1(3) 

.560 

1(7) 

.392 

9 

3(7) 

.186 

2 

.449 

7 

2(6) 

.257 

4(6) 

.171 

4 

1(4) 

.492 

3(5) 

.220 

5 

.166 

2(3) 

.360 

4 

.210 

1(10) 

.344 

1(5) 

.448 

1(8) 

.373 

2(9) 

.215 

5 

2(4) 

.312 

8 

2(7) 

.239 

10 

3(8) 

.175 

3 

.287 

3(6) 

.201 

4(7) 

.158 

~6~ 

1(6) 

.416 

4(5) 

.187 

5(6) 

.151 

Table  6.1:  Reduction  factors  a  for  the  Gaussian  distribution,  based  on  Sarhan 
and  Greenberg  (1956). 


6.4.3  Combining  Biased  Classifiers 
through  Order  Statistics 

In  this  section,  we  analyze  the  error  regions  for  biased  classifiers.  Let  us  return 
our  attention  to  bos .  First,  note  that  the  error  terms  can  no  longer  be  studied 
separately,  since  in  general  ( a  +  b)os  ^  aos  +  bos.  We  will  therefore  need  to 
specify  the  mean  and  variance  of  the  result  of  each  operation4.  Equation  6.13 
becomes: 

bos  =  Wi  +  riiixbW'-ifij  +  rijixb))0' '  (6  16) 


Let  fa  =  jj  l  W  he  the  mean  °f  classifier  biases.  Since  s  have 
zero-mean,  fa  +  Wc(xb)  has  first  moment  fa  and  variance  o2m  +  aL,  with 

"jfe  pk 

cfpm  —  E[(fi™)2]  —  fa 2,  where  [■]  denotes  the  expected  value  operator. 

Taking  a  specific  order  statistic  of  this  expression  will  modify  both  mo¬ 
ments.  The  first  moment  is  given  by  fa  +  /mos,  where  fios  is  a  shift  which 
depends  on  the  order  statistic  chosen,  but  not  on  the  class.  Then,  the  first 
moment  of  bos  is  given  by: 


(A  +  /xos)  -  A  +  vos)  __  fii-fij 


=  A 


(6.17) 


s  s 

Note  that  the  bias  term  represents  an  “average  bias”  since  the  contributions 

due  to  the  order  statistic  are  removed.  Therefore,  reductions  in  bias  cannot  be 

obtained  from  a  table  similar  to  table  6.1. 

Now,  let  us  turn  our  attention  to  the  variance.  Since  +  rf£(xb)  has 

variance  a2m  -faL ,  it  follows  that  (fa+r]k(xb))os  has  variance  o20S  =  oi(cr2m  + 
*7 k  Pk  ”k  " k 


4Since  the  exact  distribution  parameters  of  bos  are  not  known,  we  use  the  sample  mean  and 
the  sample  variance. 
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where  a  is  the  factor  discussed  in  section  6.4.2.  Therefore,  the  variance 
of  bos  is  given  by: 


_2 

ab°s 


^4 


&(Plirn  “b  ^m), 


(6.18) 


2  ~t<T0rP- 

where  cr^m  =  *  ^  *  is  the  variance  introduced  by  the  systematic  errors  of 

different  classifiers. 

We  have  now  obtained  the  first  and  second  moments  of  bos ,  and  can  com¬ 
pute  the  model  error.  Namely,  we  have  M°s  =  ft  and  a^s  =  M ™ 
leading  to: 


E°model(P)  =  \m°2s  =  +  fi2)  (6.19) 

=  S-{a(o2bm  +  o}m)  +  p1).  (6.20) 

The  reduction  in  the  error  is  more  difficult  to  assess  in  this  case.  By  writing 
the  error  as: 

Emodei(fi)  =  +  OS'")2)  +  |(aa^  +  /S2  -  a(/3m)2), 

we  get: 


Emodel(P)  =  <*EmodelW)  +  ^aoj  +  P  ~  «(/Sm)2).  (6.21) 

Analyzing  the  error  reduction  in  the  general  case  requires  knowledge  about 
the  bias  introduced  by  each  classifier.  Unlike  regression  problems  where  the 
bias  and  variance  contributions  to  the  error  are  additive  and  well-understood, 
in  classification  problems  their  interaction  is  more  complex  (Friedman  1997). 
Indeed  it  has  been  observed  that  ensemble  methods  do  more  than  simply  re¬ 
duce  the  variance  (Schapire,  Freund,  and  Bartlett  1997). 

Based  on  these  observations  and  equation  6.21,  let  us  analyze  extreme 
cases.  For  example,  if  each  classifier  has  the  same  bias,  is  reduced  to  zero 
and  ft  =  fim.  In  this  case  the  error  reduction  can  be  expressed  as: 

EZde0)  =  ^k2 + on2  =  ctE*odel{fi) + s{l~a\r)\ 

where  a  balances  the  two  contributions  to  the  error.  A  small  value  for  a  will 
reduce  the  first  component  of  the  error  (mainly  variance),  while  leaving  the 
second  term  untouched.  The  net  effect  will  be  very  similar  to  results  obtained 
for  regression  problems.  In  this  case,  it  is  important  to  reduce  classifier  bias 
before  combining  (e.g.,  by  using  an  overparametrized  model). 
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If  on  the  other  hand,  the  biases  produce  a  zero  mean  variable,  we  obtain 
4  =  0.  In  this  case,  the  model  error  becomes: 

EZdel(P)  =  «EZodelW  +  ~ 

and  the  error  reduction  will  be  significant  if  the  second  term  is  small  or  neg¬ 
ative.  In  fact,  if  the  variation  among  the  biases  is  small  relative  to  their  mag¬ 
nitude,  the  error  will  be  reduced  more  than  in  the  unbiased  cases.  If  however, 
the  variation  is  large  compared  to  the  magnitude,  the  error  reduction  will  be 
minimal.  Furthermore,  if  a  is  large  and  the  biases  are  small  and  highly  var¬ 
ied,  it  is  possible  for  this  combiner  to  do  worse  than  the  individual  classifiers, 
which  is  a  danger  not  present  for  regression  problems.  This  observation  very 
closely  parallels  results  reported  in  Friedman  (1997). 


6.5  Linear  Combining  of  Ordered 
Classifier  Outputs 

In  the  previous  section,  we  derived  error  reductions  when  the  class  posteri¬ 
ors  are  directly  estimated  through  the  ordered  classifier  outputs.  Since  simple 
averaging  has  also  been  shown  to  provide  benefits,  in  this  section,  we  inves¬ 
tigate  the  combinations  of  averaging  and  order  statistics  for  pooling  classifier 
outputs. 


6.5.1  Spread  Combiner 

The  first  linear  combination  of  ordered  classifier  outputs  we  study  focuses  on 
extrema.  As  discussed  in  section  6.4.2,  the  maximum  and  minimum  of  a  set 
of  classifier  outputs  carry  specific  meanings.  Indeed,  the  maximum  can  be 
viewed  as  the  class  for  which  there  is  the  most  evidence.  Similarly,  the  min¬ 
imum  deletes  classes  with  little  evidence.  In  order  to  avoid  a  single  classifier 
from  having  too  large  of  an  impact  on  the  eventual  output,  these  two  values 
can  be  averaged  to  yield  the  spread  combiner.  This  combiner  strikes  a  balance 
between  the  positive  and  negative  evidence,  leading  to  a  more  robust  combiner 
than  either  of  them. 

Spread  Combiner  for  Unbiased  Classifiers: 

For  a  classifier  without  bias,  the  spread  combiner  is  formally  defined  as: 
f-pr(x)  =  l(/i1:JV(x)  +  fiN:N(x ))  =  p(cj\x)  +  rfpr{x), 


(6.22) 
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where: 


The  variance  of  ^^(x)  is  given  by: 


=  +  $cov(nl:Jf(x),  n?:N(x)).  (6.23) 

where  cou(-,  •)  represents  the  covariance  between  two  variables  (even  when 
the  rji  ‘s  are  independent,  ordering  introduces  correlations).  Note  that  because 
of  the  ordering,  the  variances  in  the  first  two  terms  of  equation  6.23  can  be 
expressed  in  terms  of  the  individual  classifier  variances.  Furthermore,  the  co- 
variance  between  two  order  statistics  can  also  be  determined  in  tabulated  form 
for  given  distributions.  Table  6.2  provides  these  values  for  a  Gaussian  distri¬ 
bution  based  on  Sarhan  and  Greenberg  (1956).  This  expression  can  be  further 
simplified  for  symmetric  distributions  where  a\N  =  a2N.N  (e.g.,  Gaussian 
noise  model)  and  leads  to: 

GlT  =  ^(“l:W  +  Sl  'N:N)<rlM,  (6.24) 

where  am: n  is  the  variance  of  the  mth  ordered  sample  and  Bm  j  n  is  the  covari¬ 
ance  between  the  mth  and  /th  ordered  samples,  given  that  the  initial  samples 
had  unit  variance  (Sarhan  and  Greenberg  1956).  Because  this  is  a  symmetric 
distribution,  the  £  values  are  also  symmetric  (e.g.,  £1,2:5  =  £4,5:5). 

Then,  using  equation  6.3,  the  variance  of  the  boundary  offset  bspr  can  be 
calculated: 


J2 

Gb*Pr 


2  2 
Vjj.spr  +  Vq.spr 

2  +  BitN:N)cr%. 


(6.25) 


Finally,  through  equation  6.6,  we  can  obtain  the  reduction  in  the  model  error 
due  to  the  spread  combiner: 


E model  _  al:N  + 
E model  2 


(6.26) 


Based  on  equation  6.26  and  tables  6.1  and  6.2,  table  6.3  displays  the  error 
reductions  provided  by  the  spread  combiner  for  a  Gaussian  noise  model  (for 
comparison  purposes,  the  error  reduction  for  the  min  and  max  combiners  is 
also  provided.  Note  that  for  the  Gaussian  distribution,  the  error  reduction  of 
min  is  equal  to  that  of  max.). 
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Table  6.2:  Some  Reduction  Factors  B  for  the  Gaussian  Distribution,  based  on 
Sarhan  and  Greenberg  (1956). 


Spread  Combiner  for  Biased  Classifiers: 


Now,  if  the  classifier  biases  are  nonzero,  the  spread  combiner’s  output  is  given 
by: 


1 


f,sprM  =  +  /?■"(*)) 

=  pic,  \x)  +  (m  (x)  +  Pi)spr . 

In  that  case,  the  boundary  offset  is  given  by: 

bsPr  =  (A  +  m(xb)ypr  -  (Pj  +  Tfj  (xb))SPr 
s 

which  after  expanding  each  term  and  regrouping  can  be  expressed  as: 
(Pi  +  m(xb))l:N -(fij  +  rij(xb))UN 


cN:N  / 


(6.27) 


(6.28) 


bs?r  = 


2s 

\N:N 


N:N 


(fc  +  m(xb))N:N -(Pj  +  Vj(xb))N: 
2s 


(6.29) 


The  first  moment  of  bspr  can  be  obtained  by  analyzing  each  term  of  equa¬ 
tion  6.29.  In  fact,  the  offset  introduced  by  the  first  and  nth  order  statistic  for 
classes  i  and  j  will  cancel  each  other  out,  leaving  only  the  average  bias  be¬ 
tween  the  min  and  max  components  of  the  error  (as  in  equation  6.17),  given 

ay.N_a\:N+pN:N^pH-.N 

by  fispr  =  - _ /  1 _ th _ 
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N 

spread 

min  or  max 

2 

.500 

.682 

3 

.362 

.560 

4 

.299 

.492 

5 

.261 

.448 

6 

.236 

.416 

7 

.219 

.392 

8 

.205 

.373 

9 

.194 

.357 

10 

.186 

.344 

Table  6.3:  EiTor  reduction  factors  for  the  spread,  min  and  max  combiners  with 
Gaussian  noise  model. 


The  variance  of  bspr  needs  to  be  derived  from  equation  6.29.  Proceeding 
as  in  equation  6.18,  the  variance  of  the  spread  combiner  can  be  expressed  as: 

2  A  1  1  _  9? 

CTbspr  =  (~ai:AT  +  -OtN:N  +  +  GTpm).  (6.30) 

For  a  symmetric  distribution  (where  ot\: m  =  &N\N )»  we  obtain  the  following 
error: 

E model  (P)  =  ^2  =  ~  (Cr&pr  +  M  [ 2) 

=  2  2  B\,N'.N)(abm  +  &pm)  “h  (fiSpr)^ 

1 

~  2^aiN  +  Bl,N:N)Emodel(P)  + 

J(«l +  BUN:N)(ajm  -  O0m)2)  +  ^(^r)2,  (6.31) 

which  is  very  similar  to  equation  6.21,  where  the  value  of  ot  for  a  single  order 
statistic  is  now  replaced  by  a  since  the  mean  of  the  first  and  /zth 

order  statistic  is  used  in  the  posterior  estimate. 


6.5.2  Trimmed  Means 

Instead  of  actively  using  the  extreme  values  as  was  the  case  with  the  spread 
combiner,  one  can  base  the  posterior  estimate  around  the  median  values.  How¬ 
ever,  instead  of  selecting  one  classifier  output  as  was  done  for  fmedy  one  can 
use  multiple  classifiers  whose  outputs  are  “typical.”  In  this  scheme,  only  a  cer¬ 
tain  fraction  of  all  available  classifiers  are  used  for  a  given  pattern.  The  main 
advantage  of  this  method  over  weighted  averaging  is  that  the  set  of  classifiers 
which  contribute  to  the  combiner  vary  from  pattern  to  pattern.  Furthermore, 
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N 

ave  (for  N) 

trim  (for  N\  =  2  ;  yv2  =  N  —  1) 

ave  (for  N  —  2) 

3 

.333 

.449 

1.00 

4 

.250 

.298 

.500 

5 

.200 

.227 

.333 

6 

.167 

.184 

.250 

7 

.143 

.155 

.200 

8 

.125 

.134 

.167 

9 

.111 

.113 

.143 

Table  6.4:  Error  Reduction  factors  for  trim  and  two  corresponding  ave  com¬ 
biners  with  Gaussian  noise  model. 


where  am: n  is  the  variance  of  the  mth  ordered  sample  and  Bm  j:^  is  the  covari¬ 
ance  between  the  mth  and  Ith  ordered  samples,  given  that  the  initial  samples 
had  unit  variance  (Sarhan  and  Greenberg  1956).  Using  the  theory  highlighted 
in  section  6.3,  and  equation  6.34,  we  obtain  the  following  model  error  reduc¬ 
tion: 


ptrim 
n model 

Emodel 


1 


(N2  -Ni  +  1)2 


Ni  Ni  \ 

T.  +  2  ^2  )  •  (6-35) 

m=N i  m—N\  l>m  ) 


Based  on  equation  6.35  and  tables  6. 1  and  6.2,  we  have  generated  a  sample 
trim  combiner  reduction  table.  Because  there  are  many  possibilities  for  N\ 
and  iV2,  a  table  that  exhaustively  provides  all  reduction  values  is  not  practical. 
In  this  sample  table  we  have  selected  N\  =  2  and  A2  =  N  —  1,  that  is,  aver¬ 
aging  after  the  lowest  and  highest  values  have  been  removed.  For  comparison 
purposes  the  reduction  factors  of  the  averaging  combiner  for  N  and  N—  2  clas¬ 
sifiers  are  also  provided  (for  i.i.d.  classifiers  the  reduction  factors  are  1/N  as 
derived  in  Turner  and  Ghosh  [  1 996a] ;  similar  results  were  obtained  for  regres¬ 
sion  problems  [Perrone  and  Cooper  1993]).  As  these  numbers  demonstrate, 
although  N  —  2  classifiers  are  used  in  the  trim  combiner,  selectively  weeding 
out  undesirable  classifiers  provides  reduction  factors  significantly  better  than 
simply  averaging  N  —  2  arbitrary  classifiers.  The  trim  combiner  provides  re¬ 
duction  factors  comparable  the  the  N  classifier  ave  combiner  without  being 
susceptible  to  corruption  by  one  particularly  faulty  classifier. 


IHmmed  mean  Combiner  for  Biased  Classifiers: 

Now,  if  the  classifier  biases  are  nonzero,  the  trimmed  mean  combiner’s  output 
is  given  by: 


1  m—N  i 


=  P(ci\x)  +  (ni(x)  +  piyrim. 


(6.36) 
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In  that  case  the  boundary  offset  is  given  by: 


fotrim  _ 


(Pi  +  rn(xb))trim  -  (Pj  +  r,j(xbyrm 

s 


(6.37) 


The  first  moment  of  blr‘m  can  be  obtained  from  a  manner  similar  to  that  of 
the  spread  combiner.  Indeed,  each  mean  offset  introduced  by  a  specific  order 
statistic  for  class  i  will  be  offset  by  the  one  introduced  for  class  j .  Only  the 
trimmed  mean  of  the  biases  will  remain,  giving  the  first  moment  of  btr  m : 


ptrim 


1 

N2-N1  +  1 


m=N  1 


(6.38) 


In  deriving  the  variance  of  btrim ,  we  follow  the  same  steps  as  in  sec¬ 
tions  6.4.3  and  6.5.1.  The  resulting  boundary  variance  is  similar  to  equa¬ 
tion  6.18,  but  the  since  the  reduction  is  due  to  the  linear  combination  of  multi¬ 
ple  ordered  outputs,  a  is  replaced  by  A,  where: 


A  = 


(N2  -  Nt  +  1)2  \  Jz 


n2  n2  \ 

£  am,N+2  £  £>„,,/:*  •  (6.39) 

i=Vi  m=N  1  l>m  f 


The  model  error  reduction  in  this  case  is  given  by: 


=  ^M2=S-(4rim  +  Ml2) 

=  —  (A(cr^m  +  ojm)  +  (pspr)2^ 

=  AEmodel(P)  +  S-(A(ajm  -  (D2)  +  (psn2)-  (6.40) 

Once  again  we  need  to  look  at  the  interaction  between  the  two  parts  of  the  error 
reduction.  The  first  term  provides  the  error  reduction  compared  to  the  model 
error  of  an  individual  classifier.  The  smaller  A  is,  the  more  error  reduction 
there  will  be.  In  the  second  term,  on  the  other  hand,  a  small  value  for  A  is 
only  useful  if  the  variability  in  the  individual  biases  is  higher  than  the  biases 
themselves  >  (£m)2). 


6.6  Experimental  Results 

The  order  statistics-based  combining  methods  proposed  in  this  article  are  tai¬ 
lored  for  situations  where  one  or  more  of  the  following  apply:  (1)  individual 
classifier  performance  is  uneven  and  class  dependent;  (2)  it  is  not  possible 
(insufficient  data,  high  amount  of  noise)  to  fine  tune  the  individual  classifiers 
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without  using  computationally  expensive  methods;  and  (3)  all  the  features  may 
not  be  available  to  all  the  classifiers. 

Such  situations  occur,  for  example,  in  electrical  logging  while  drilling  for 
oil,  where  data  from  certain  well  sites  almost  completely  misses  out  on  por¬ 
tions  of  the  problem  space,  and  in  imaging  from  airborne  platforms  where  the 
classifiers  receive  inputs  from  different  satellites  and/or  different  types  of  sen¬ 
sors  (e.g.,  thermal,  optical,  SAR).  In  this  article  we  restrict  ourselves  to  public 
domain  datasets  and  simulate  such  variability  in  two  ways,  namely,  by  (1) 
segmenting  the  feature  set  and  allowing  individual  classifiers  to  have  access 
to  only  a  limited  portion  of  the  feature  set;  and  (2)  using  “early  stopping”  i.e., 
prematurely  terminating  the  training  of  the  individual  classifiers.5 

For  the  experiments  reported  below,  we  used  a  multilayer  perceptron 
(MLP)  with  a  single  hidden  layer,  whose  weights  were  randomly  initialized  for 
each  run.  All  classification  results  reported  in  this  article  are  test  set  error  rates 
averaged  over  20  runs,  along  with  the  differences  in  the  mean  (standard  devi¬ 
ation  divided  by  square  root  of  the  number  of  runs).  Several  types  of  simple 
combiners  such  as  averaging,  weighted  averaging,  voting,  median,  products, 
weighted  products  (Bayesian),  using  Dempster-Schafer  theory  of  evidence, 
and  entropy-based  averaging,  have  been  proposed  in  the  literature.  However, 
on  a  wide  variety  of  data  sets,  it  has  been  observed  that  simple  averaging  usu- 
ally  provides  results  comparable  to  any  of  these  techniques  (and,  surprisingly, 
often  better  than  most  of  them)  (Ghosh,  Turner,  Beck,  and  Deuser  1992;  Turner 
and  Ghosh  1996b).  For  this  reason,  in  this  study,  we  use  the  average  combiner 
as  a  representative  of  simple  combiners,  for  comparison  purposes. 

6.6.1  Variability  through  Segmentation 

The  first  group  of  experiments  focus  on  classifiers  that  because  of  circum¬ 
stances  (e.g.,  geography)  have  access  to  only  a  part  of  the  full  feature  set. 
This  situation  fits  the  collective  data  mining  framework  wherein  a  globally 
distributed  dataset  is  vertically  partitioned  (see  chapter  5).  Unfortunately,  we 
are  not  aware  of  any  public  domain  datasets  for  collective  data  mining,  so  we 
instead  selected  three  data  sets  from  the  Probenl/UCI  benchmarks  (Prechelt 
1994).  Briefly  these  data  sets,  and  the  corresponding  size  of  the  MLP  used, 
are  :  Card:  a  51 -dimensional,  2-class  data  set  based  on  credit  approval  deci¬ 
sion  (Quinlan  1987),  with  690  patterns;  an  MLP  with  10  hidden  units;  Gene: 
a  120-dimensional  data  set  with  two  classes,  based  on  the  detection  of  splice 
junctions  in  DNA  sequences  (Noordewier,  Towell,  and  Shavlik  1991),  with 

In  all  the  experiments  reported  here,  “high  variability”  means  that  classifiers  in  an  ensemble 
were  trained  half  as  along  as  they  would  have  been,  had  they  been  stand-alone  classifiers. 

6 The  number  of  hidden  units  was  determined  experimentally. 
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Data 

Number  of  Original 
Features 

Number  of 
Segments 

Features  per  Segment 

no  Overlap 

Overlap 

Card 

51 

4 

13-13-13-12 

18-18-18-18 

Gene 

120 

4 

30-30-30-30 

40-40-40-40 

Sat 

36 

4 

9-9-9-9 

15-15-15-15 

Table  6.5:  Number  of  features  in  Probenl/UCI  data  sets 


3175  patterns;  an  MLP  with  10  hidden  units;  Satellite:  a  36-dimensional,  6- 
class  data  set  with  6435  examples  of  feature  vectors  extracted  from  satellite 
imagery;  an  MLP  with  20  hidden  units. 

These  three  sets  were  chosen  as  they  have  relatively  large  number  of  fea¬ 
tures,  somewhat  large  number  of  data  points,  and  have  been  studied  by  several 
researchers.  Also  note  that  the  Probenl  benchmarks  are  particular  training, 
validation  and  test  splits  of  the  UCI  data  sets.7  The  results  presented  in  this 
article  are  based  on  the  first  training,  validation  and  test  partition  discussed  in 
Prechelt  (1994),  where  half  the  data  is  used  for  training,  and  a  quarter  each  for 
validation  and  testing  purposes. 

We  investigate  two  situations:  one  where  the  original  features  were  ran¬ 
domly  and  disjointly  partitioned  among  the  different  segments,  and  the  sec¬ 
ond  where  there  is  some  overlap  among  features  in  different  segments.  The 
exact  segment  count  and  number  of  features  within  each  segment  is  specified 
in  table  6.5. 

For  each  data  set,  we  present  the  original  number  of  features,  the  number 
of  new  features  sets  that  result  when  the  feature  set  is  segmented  (for  Gene  we 
only  have  two  new  sets,  because  the  low  dimensionality  prevents  any  further 
segmentation),  and  the  resulting  number  of  features  in  each  segment  with  and 
without  overlap  among  the  features. 

A  classifier  trains  on  data  from  one  segment,  and  different  classifiers  op¬ 
erate  on  different  segments.  When  the  number  of  combiners  were  higher  than 
the  number  of  segments  (. N  =  8)  more  than  on  classifier  (starting  from  a  differ¬ 
ent  initialization)  was  trained  on  the  same  features. 

Tables  6.6-6.7  present  the  results  (with  the  best  result  for  each  case  in  bold 
font).  The  misclassification  percentage  for  individual  classifiers  are  reported 
in  the  first  column.  For  the  trimmed  mean  combiner,  we  also  provide  N\ 
and  A2,  the  upper  and  lower  cutting  points  in  the  ordered  average  used  in 
equation  6.32,  obtained  through  the  validation  set. 

In  this  case,  for  two  of  the  three  data  sets  (Gene  and  Card),  there  are  strik¬ 
ing  gains  due  to  using  order  statistics  combiners.  One  cause  for  these  gains  is 
the  high  variability  in  performance  among  the  component  classifiers.  In  such 
cases,  a  small  number  of  poor  classifiers  can  corrupt  the  average  combiner.  By 


7 available  from  www.ics.uci.eduT mleam/MLRepository.html 
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Data 

Ave 

Max 

H 

12.21  ±  .00 

10.58  ±  .06 

10.61  ±  .06 

10.58  ±  .00 

12.21  ±  .00  (3-4) 

BawHfrl 

■9 

12.21  ±.00 

10.47  ±.00 

10.61  ±  .06 

10.47  ±  .00 

10.47  ±  .00  (7-8) 

_  Gene 

4 

18.52  ±.10 

MIXYEWkm 

14.72  ±  .15 

16.86  ±.15  (3-4) 

8 

18.06  ±.06 

17.59  ±.17 

13.69  ±.11 

13.39  ±  .08  (7-8) 

Sat 

4 

14.16  ±.08 

14.73  ±.18 

14.64  ±.16 

14.24  ±.12 

14.00  ±  .07  (3-4) 

16.40  ±0.56 

8 

14.21  ±  .05 

15.27  ±.15 

14.49  ±.11 

14.01  ±  .04  (3-5) 

Table  6.6:  Segmented  features  with  overlap  (%  misclassified  ±.cr/«Jn). 


Data 

HH 

Ave 

Max 

Min 

Spread 

Trim  (N\-N2) 

IPV^SMI 

H 

12.21  ±  .00 

10.49  ±  .03 

10.49  ±  .03 

12.21  ±.00  (3-4) 

EEBEE1 

B 

12.21  ±  .00 

10.78  ±  .07 

10.78  ±  .07 

11.05  ±.00  (7-8) 

Gene 

4 

24.35  ±.13 

15.82  ±.15 

19.15  ±.22 

14.09  ±.11 

23.11  ±.15  (3-4) 

36.87  ±  3.01 

Efl 

23.33  ±.19 

14.99  ±.14 

16.78  ±  .24 

13.23  ±  .15 

15.03  ±  .17  (7-8) 

■WgHTjKI 

H 

14.39  ±  .09 

15.66  ±.15 

15.46  ±.11 

15.11  ±.11 

14.22  ±.07  (2-3) 

D 

14.37  ±  .05 

15.93  ±.06 

15.53  ±.06 

15.18  ±.10 

14.04  ±  .05  (3-5) 

Table  6.7:  Segmented  features  without  overlap  (%  misclassified  ±cr/</n). 


their  very  nature,  though,  combiners  based  on  order  statistics  are  immune  to 
this  type  of  corruption.  The  ave  combiner  performs  well  on  the  Sat  data  sets 
where  the  performance  among  the  individual  classifiers  is  much  more  homo¬ 
geneous.  In  this  case,  the  ave  results  are  only  marginally  worse  than  those  for 
the  trimmed  mean. 


6.6.2  Variability  through  Early  Stopping 

For  the  second  set  of  experiments  we  use  two  classes  of  acoustic  underwater 
sonar  signals8.  From  the  original  sonar  signals  of  four  different  underwater 
objects  (porpoise  sound,  cracking  ice  and  two  different  whale  sounds),  two 
feature  sets  are  extracted  (Ghosh,  Deuser,  and  Beck  1992):  (1)  WOC:  a  25- 
dimensional  feature  set,  consisting  of  Gabor  wavelet  coefficients,  temporal 
descriptors  and  spectral  measurements;  and,  (2)  RDO:  a  24-dimensional  fea¬ 
ture  set,  consisting  of  reflection  coefficients  based  on  both  short  and  long  time 
windows,  and  temporal  descriptors.  For  both  feature  sets,  an  MLP  with  50  hid¬ 
den  units  was  used.9  Further  details  about  this  4-class  problem  can  be  found 
in  Ghosh,  Deuser,  and  Beck  (1992)  and  Turner  and  Ghosh  (1996b). 

Table  6.8  presents  the  combining  results  for  the  underwater  acoustic  data 
set  when  the  individual  classifier  performance  is  highly  variable.  The  results  of 
table  6.8  as  well  as  those  given  in  Turner  and  Ghosh  (1999)  indicate  that  when 
the  individual  classifier  performance  is  highly  variable,  order  statistics-based 

8  Results  on  6  Proben/UCI  datasets  were  reported  in  Turner  and  Ghosh  (1999)  and  hence  are 
not  repeated  here 

^These  data  sets  are  available  at  www.lans.ece.utexas.edu. 
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Data 

N 

Ave 

Max 

Min 

Spread 

Trim(Vr/V2) 

RDO 

4 

11.57  ±.11 

11. 52  ±.20 

11.04  ±  .09 

11. 34  ±.14  (3-4) 

13.32  ±0.83 

8 

11.64  ±.09 

11.29  ±  .13 

11.51  ±.09 

12.30  ±  .08  (4-5) 

woe 

4 

8.80  ±.09 

7.84  ±  .10 

9.31  ±.12 

8.54  ±.06 

8.43  ±  .13  (3-4) 

12.07  ±  1.12 

8 

8.82  ±  .08  ! 

7.68  ±  .12 

8.91  ±  .06 

8.24  ±.11 

7.81  ±  .08  (7-8) 

Table  6.8:  Combining  results  in  the  presence  of  high  variability  in  individual 
classifier  performance  for  the  sonar  data  (%  misclassified  ±o/«Jn). 


combiners  (particularly  the  spread  combiner)  typically  provide  better  classi¬ 
fication  results  than  other  simple  combiners.  This  performance  improvement 
is  obtained  without  sacrificing  the  simplicity  of  the  combiner.  One  important 
thing  to  note,  however,  is  that  in  all  cases  studied,  the  order  statistics  based 
combiners  performed  at  least  as  well  as  the  simple  combiner,  implying  that  no 
risk  is  taken  by  using  this  method. 

A  close  inspection  of  these  results  reveals  that  using  either  the  max  or  min 
combiner  can  provide  better  classification  rates  than  ave ,  but  it  is  difficult  to 
determine  which  of  the  two  will  be  more  successful  given  a  data  set.  A  valida¬ 
tion  set  may  be  used  to  select  one  over  the  other,  but  in  that  case,  potentially 
precious  training  data  is  used  solely  for  determining  which  combiner  to  use. 
The  use  of  the  spread  combiner  removes  this  dilemma  by  consistently  pro¬ 
viding  results  that  are  comparable  to,  or  better  than,  the  best  of  the  max-min 
duo. 


6.7  Concluding  Remarks 

In  this  article  we  present  and  analyze  combiners  based  on  order  statistics. 
These  combiners  blend  the  simplicity  of  averaging  with  the  generality  of  met- 
aleamers.  They  are  particularly  effective  if  there  are  significant  variations 
among  component  classifiers  in  at  least  some  parts  of  the  joint  input-output 
space.  Variations  can  arise  when  the  individual  training  sets  cannot  be  consid¬ 
ered  as  random  samples  from  a  common  universal  data  set.  Examples  of  such 
cases  include  real-time  data  acquisition  and  classification  from  geographically 
distributed  sources  or  data  mining  problems  with  large  and  possibly  heteroge¬ 
nous  databases,  where  random  subsampling  is  computationally  expensive  and 
practical  methods  lead  to  nonrandom  subsamples  (Bradley  and  Fayyad,  1998). 
The  robustness  of  order  statistics  combiners  is  also  helpful  when  certain  indi¬ 
vidual  classifiers  experience  catastrophic  failures  (e.g.,  due  to  faulty  sensors). 

The  analytical  framework  provided  in  this  chapter  quantifies  the  reductions 
in  error  achieved  when  an  order  statistics  based  ensemble  is  used.  It  also  shows 
that  the  two  methods  for  linear  combination  of  order  statistics  introduced  in 
this  chapter  provide  more  reliable  estimates  of  the  true  posteriors  than  any  of 
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the  individual  order  statistic  combiners. 

The  experimental  results  of  section  6.5  indicate  that  when  there  is  sig¬ 
nificant  variability  among  the  classifiers,  the  order  statistics-based  combiners 
substantially  outperform  simple  combiners.  Our  previous  results  also  showed 
that  in  the  absence  of  such  variability  these  combiners  perform  no  worse.  Thus 
the  family  of  order  statistic  combiners  are  applicable  over  a  wide  range  of  situ¬ 
ations.  They  are  able  to  extract  an  appropriate  amount  of  information  from  the 
individual  classifier  outputs  without  requiring  tuning  additional  parameters  as 
in  metaleamers,  and  without  being  substantially  affected  by  outliers. 

A  future  endeavor,  which  will  be  helpful  for  this  work  as  well  as  for  the 
study  of  collective  data  mining  on  very  large  datasets  in  general,  is  to  ob¬ 
tain  a  suite  of  public  domain  datasets  which  are  intrinsically  partitioned  into 
segments  with  varying  quality.  Though  such  situations  sometimes  occur  in 
practice  (for  example  in  oil  logging  data  (Chakravarthy  1997)  and  mortgage 
scoring  (Merz  1998);  both  data  sets  proprietary),  they  are  not  represented  in 
the  standard,  venerable  databases  such  as  UCI,  ELENA  and  Statlog  typically 
used  by  the  academic  community.  Perhaps  the  recent  CRoss-Industry  Standard 
Process  for  Data  Mining  (CRISP- DM)  initiative  will  provide  a  satisfactory  so¬ 
lution  to  this  problem  in  the  near  future. 
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AUTHOR:  AAAI  PRESS  FOLLOWS  A  ’’DOWN”  STYLE  FOR  PARTS  OF 
THE  BOOK  AND  MOST  TITLES  (UNLESS  A  PROPER  NAME),  (figure, 
table,  chapter,  etc.).  ALSO  NOTE  THAT  CHICAGO  MANUAL  OF  STYLE 
AND  AAAI  PRESS  STYLE  CALL  FOR  ALMOST  ALL  PREFIXES  TO  BE 
SPELLED  SOLID  (NO  HYPHEN)  EVEN  WHEN  USED  AS  ADJECTIVES. 
ALSO,  MOST  LISTS  HAVE  BEEN  CONVERTED  TO  PARAGRAPH  STYLE 
TO  CONSERVE  SPACE  (BOOK  IS  TOO  LONG!). 


